[BugFix] Fix bugs in /v1/abort_requests interface from PR(#6992)#7176
[BugFix] Fix bugs in /v1/abort_requests interface from PR(#6992)#7176qwes5s5 wants to merge 3 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7176 +/- ##
==========================================
Coverage ? 73.85%
==========================================
Files ? 383
Lines ? 53641
Branches ? 8412
==========================================
Hits ? 39614
Misses ? 11329
Partials ? 2698
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
81f06aa to
81239a0
Compare
There was a problem hiding this comment.
Pull request overview
该 PR 用于修复 PR(#6992) 引入的 /v1/abort_requests 相关问题,避免在多模态/Completion 场景下触发 500 或 OpenAI 协议 finish_reason 校验失败,并补强 abort 清理路径的健壮性。
Changes:
- 扩展 OpenAI 协议响应的
finish_reason枚举,加入"abort"以通过校验。 - 优化 abort 等待/清理逻辑:对已结束请求做过滤与集合清理,并在资源回收时清理 abort 相关集合,避免残留导致异常。
- 更新/修复相关单测,使其与新的过滤逻辑一致。
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/engine/test_common_engine.py | 调整 _wait_abort_complete 相关用例,补齐 requests 字典以匹配新过滤逻辑 |
| fastdeploy/entrypoints/openai/protocol.py | finish_reason 增加 "abort",修复协议校验失败 |
| fastdeploy/engine/sched/resource_manager_v1.py | abort/finish 回收路径使用 discard 并补充清理 abort 集合 |
| fastdeploy/engine/common_engine.py | _wait_abort_complete 增强:过滤已结束目标并清理 abort 集合,降低卡住/500 风险 |
| target_set = set(target_req_ids) | ||
| target_set = target_set & (set(self.resource_manager.requests.keys()) | set(self.scheduler.requests.keys())) | ||
| prev_remaining_count = len(target_set) | ||
| last_progress_time = time.time() | ||
| remaining = target_set & self.resource_manager.get_reqs_in_aborting() |
There was a problem hiding this comment.
这里直接遍历并构造 set(self.resource_manager.requests.keys()) / set(self.scheduler.requests.keys()) 没有持有对应的锁(ResourceManagerV1.lock、LocalScheduler.mutex)。这两个 dict 在运行时会被其他线程修改(scheduler/local_scheduler.py 明确用 mutex 保护 requests),可能触发 RuntimeError: dictionary changed size during iteration,导致 abort 流程再次 500。建议通过线程安全的接口获取 request_id 快照(例如在 ResourceManager/Scheduler 中新增 get_request_ids 方法并在内部加锁),或在此处用各自的锁保护读取。
| logprobs: Optional[LogProbs] = None | ||
| draft_logprobs: Optional[LogProbs] = None | ||
| prompt_logprobs: Optional[PromptLogprobs] = None | ||
| finish_reason: Optional[Literal["stop", "length", "tool_calls", "recover_stop"]] | ||
| finish_reason: Optional[Literal["stop", "length", "tool_calls", "recover_stop", "abort"]] | ||
| speculate_metrics: Optional[SpeculateMetrics] = None |
There was a problem hiding this comment.
PR 描述的 Modifications 仅提到了 finish_reason 的变更,但本 PR 还修改了 abort 等待/清理逻辑(common_engine/resource_manager)并补充了单测。建议同步更新 PR 描述,明确这些额外改动的原因与影响范围,便于审阅与回溯。
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-10
📋 Review 摘要
PR 概述:修复 PR#6992 引入的 abort_requests 接口 bug,修复 finish_reason 验证失败和 500 错误问题
变更范围:fastdeploy/engine/、fastdeploy/entrypoints/openai/、tests/engine/
影响面 Tag:[BugFix] [Engine] [APIServer]
📝 PR 规范检查
PR 标题包含有效 Tag [BugFix],Motivation 和 Modifications 描述清晰,符合规范。
问题
未发现阻塞性问题。
总体评价
本次 PR 正确修复了 PR#6992 引入的问题:
- protocol.py:在所有 finish_reason 字段的 Literal 类型中添加了 "abort" 值,修复了验证失败问题
- resource_manager_v1.py:将
remove改为discard避免 KeyError,并在finish_requests中正确清理 abort 集合 - common_engine.py:添加了对已完成请求的处理逻辑,避免 abort 处理中的潜在问题
- test_common_engine.py:正确修复了测试用例以适应新的逻辑
代码修改逻辑正确,修复了已知问题。
Motivation
It was discovered that PR(#6992) introduced bugs that may trigger 500 errors or finish_reason validation failures when using multimodal or completions interfaces.
Modifications
Modified finish_reason to include the "abort" value.
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.